A Hybrid Model for Detection and Elimination of Near- Duplicates Based on Web Provenance for Effective Web Search

نویسندگان

  • Tanvi Gupta
  • Latha Banda
چکیده

Users of World Wide Web utilize search engines for information retrieval in web as search engines play a vital role in finding information on the web. But, the voluminous amount of web documents has weakened the performance and reliability of web search engines. As, the subsistence of near-duplicate data is an issue that accompanies the growing need to incorporate heterogeneous data. These pages either increase the index storage space or increase the serving costs thereby irritating the users. Near-duplicate detection has been recognized as an important one in the field of plagiarism detection, spam detection and in focused web crawling scenarios. Such near-duplicates can be detected and eliminated using the concept of Web Provenance and TDW matrix Algorithm. The proposed work is the model that combines content, context, semantic structure and trust based factors for classifying and eliminating the results as original or near-duplicates.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Near-Duplicates Detection and Elimination Based on Web Provenance for Effective Web Search

Users of World Wide Web utilize search engines for information retrieval in web as search engines play a vital role in finding information on the web. However, the performance of a web search is greatly affected by flooding of search results with information that is redundant in nature i.e., existence of nearduplicates. Such near-duplicates holdup the other promising results to the users. Many ...

متن کامل

An Ensemble Click Model for Web Document Ranking

Annually, web search engine providers spend more and more money on documents ranking in search engines result pages (SERP). Click models provide advantageous information for ranking documents in SERPs through modeling interactions among users and search engines. Here, three modules are employed to create a hybrid click model; the first module is a PGM-based click model, the second module in a d...

متن کامل

Provenance Based Web Search

During web search, we often end up with untrusted, duplicates and near duplicate search results which dilutes the focus of search query. Factors that may influence the trust of web search results shall be referred to as 'Provenance'. Provenance is basically the information about the history of data. In this paper, we propose a provenance model which uses both content based and trust based facto...

متن کامل

A New Hybrid Method for Web Pages Ranking in Search Engines

There are many algorithms for optimizing the search engine results, ranking takes place according to one or more parameters such as; Backward Links, Forward Links, Content, click through rate and etc. The quality and performance of these algorithms depend on the listed parameters. The ranking is one of the most important components of the search engine that represents the degree of the vitality...

متن کامل

Web-Scale Near-Duplicate Search: Techniques and Applications

A s the bandwidth accessible to average users has increased, audiovisual material has become the fastest growing datatype on the Internet. The impressive growth of the social Web, where users can exchange user-generated content, contributes to the overwhelming number of multimedia files available. Among these huge volumes of data, a large numbers of near duplicates and copies exist. File copies...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012